16 research outputs found

    MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP

    No full text
    International audienceWe present MultiVec, a new toolkit for computing continuous representations for text at different granularity levels (word-level or sequences of words). MultiVec includes Mikolov et al. [2013b]'s word2vec features, Le and Mikolov [2014]'s paragraph vector (batch and online) and Luong et al. [2015]'s model for bilingual distributed representations. MultiVec also includes different distance measures between words and sequences of words. The toolkit is written in C++ and is aimed at being fast (in the same order of magnitude as word2vec), easy to use, and easy to extend. It has been evaluated on several NLP tasks: the analogical reasoning task, sentiment analysis, and crosslingual document classification

    Word2Vec vs DBnary ou comment (ré)concilier représentations distribuées et réseaux lexico-sémantiques ? Le cas de l’évaluation en traduction automatique

    No full text
    International audienceThis paper presents an approach combining lexical-semantic resources and distributed representations of words applied to the evaluation in machine translation (MT). This study is made through the enrichment of a well-known MT evaluation metric : METEOR. METEOR enables an approximate match (synonymy or morphological similarity) between an automatic and a reference translation. Our experiments are made in the framework of the Metrics task of WMT 2014. We show that distributed representations are less efficient than lexical-semantic resources for MT evaluation but they can nonetheless bring interesting additional information

    MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP

    Get PDF
    International audienceWe present MultiVec, a new toolkit for computing continuous representations for text at different granularity levels (word-level or sequences of words). MultiVec includes Mikolov et al. [2013b]'s word2vec features, Le and Mikolov [2014]'s paragraph vector (batch and online) and Luong et al. [2015]'s model for bilingual distributed representations. MultiVec also includes different distance measures between words and sequences of words. The toolkit is written in C++ and is aimed at being fast (in the same order of magnitude as word2vec), easy to use, and easy to extend. It has been evaluated on several NLP tasks: the analogical reasoning task, sentiment analysis, and crosslingual document classification

    Better Evaluation of ASR in Speech Translation Context Using Word Embeddings

    No full text
    International audienceThis paper investigates the evaluation of ASR in spoken language translation context. More precisely, we propose a simple extension of WER metric in order to penalize differently substitution errors according to their context using word embeddings. For instance, the proposed metric should catch near matches (mainly morphological variants) and penalize less this kind of error which has a more limited impact on translation performance. Our experiments show that the correlation of the new proposed metric with SLT performance is better than the one of WER. Oracle experiments are also conducted and show the ability of our metric to find better hypotheses (to be translated) in the ASR N-best. Finally, a preliminary experiment where ASR tuning is based on our new metric shows encouraging results. For reproductible experiments, the code allowing to call our modified WER and the corpora used are made available to the research community

    Word2Vec vs DBnary: Augmenting METEOR using Vector Representations or Lexical Resources?

    Get PDF
    International audienceThis paper presents an approach combining lexico-semantic resources and distributed representations of words applied to the evaluation in machine translation (MT). This study is made through the enrichment of a well-known MT evaluation metric: METEOR. This metric enables an approximate match (synonymy or morphological similarity) between an automatic and a reference translation. Our experiments are made in the framework of the Metrics task of WMT 2014. We show that distributed representations are a good alternative to lexico-semantic resources for MT evaluation and they can even bring interesting additional information. The augmented versions of METEOR, using vector representations, are made available on our Github page

    Distribution maps of twenty-four Mediterranean and European ecologically and economically important forest tree species compiled from historical data collections

    Get PDF
    Species distribution maps are often lacking for scientific investigation and strategic management planning at international level. Here, we present the range-wide, natural distribution maps of twenty-four Mediterranean and European forest-tree species of key ecological and economic importance in the Mediterranean basin. Data on the geographic distribution of the twenty-four tree species were compiled from over one hundred published sources, making this contribution one of the most extensive resource available from historical data. Dataset can be accessed at: https://doi.org/10.5281/zenodo.822953. Associated metadata can be accessed at: http://www.fao.org/geonetwork/srv/en/metadata.show?id=56996. These data provide key spatial information to further investigate species occurrence-environment relationships, provide a baseline to assess the future impact of climate change, identify marginal populations with specific genetic resources, among other possible applications

    An Open Source Toolkit for Word-level Confidence Estimation in Machine Translation

    No full text
    International audienceRecently, a growing need of Confidence Estimation (CE) for Statistical Machine Translation (SMT) systems in Computer Aided Translation (CAT), was observed. However, most of the CE toolkits are optimized for a single target language (mainly English) and, as far as we know, none of them are dedicated to this specific task and freely available. This paper presents an open-source toolkit for predicting the quality of words of a SMT output, whose novel contributions are (i) support for various target languages, (ii) handle a number of features of different types (system-based, lexical , syntactic and semantic). In addition, the toolkit also integrates a wide variety of Natural Language Processing or Machine Learning tools to pre-process data, extract features and estimate confidence at word-level. Features for Word-level Confidence Estimation (WCE) can be easily added / removed using a configuration file. We validate the toolkit by experimenting in the WCE evaluation framework of WMT shared task with two language pairs: French-English and English-Spanish. The toolkit is made available to the research community with ready-made scripts to launch full experiments on these language pairs, while achieving state-of-the-art and reproducible performances

    MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP

    Get PDF
    International audienceWe present MultiVec, a new toolkit for computing continuous representations for text at different granularity levels (word-level or sequences of words). MultiVec includes Mikolov et al. [2013b]'s word2vec features, Le and Mikolov [2014]'s paragraph vector (batch and online) and Luong et al. [2015]'s model for bilingual distributed representations. MultiVec also includes different distance measures between words and sequences of words. The toolkit is written in C++ and is aimed at being fast (in the same order of magnitude as word2vec), easy to use, and easy to extend. It has been evaluated on several NLP tasks: the analogical reasoning task, sentiment analysis, and crosslingual document classification

    Delivering tree genetic resources in forest and landscape restoration. A guide to ensuring local and global impact

    Get PDF
    In the last 25 years, almost 50 million hectares of primary forest have been lost due to deforestation. Numerous international initiatives such as the Bonn Challenge and the New York Declaration on Forests have set ambitious goals to restore degraded and deforested lands by 2030. Realizing global commitments on forest and landscape restoration (FLR) will require the establishment of billions of trees on millions of hectares of degraded land to address the triple crisis of biodiversity loss, climate change and failing food systems. A significant amount of FLR will require tree planting or increasing tree cover in production landscapes. The scaled delivery of tree genetic resources (TGR), in other words, the diversity of species and genotypes from seeds and other forest reproductive material, will be critical to achieving impact. Lack of available forest reproductive material undermines the scaling of FLR and its potential to deliver expected benefits. Achieving impact from FLR requires an abundance of seeds and seedlings from many species and sources of genetic diversity within species. The aim of this working paper is to highlight key challenges and opportunities for the integration of TGR – from genes and species to landscapes – in current FLR projects. We first explore why TGR are so important and identify the key role that they play in supporting biodiversity, mitigating and adapting to climate change, and enhancing resilient livelihoods. Second, we evaluate the challenges and barriers to scaling the use of TGR in restoration, and how these undermine the potential of FLR to deliver expected benefits. Third, we review recent opportunities and innovations in the latest literature for mainstreaming TGR in FLR and present 13 case studies from around the world, representing state-of-the-art and best practices for TGR conservation and use. We then summarize the findings from these case studies, covering a range of topics from improved in situ and ex situ conservation of TGR, strategies for ensuring high-quality, diverse planting material, and evidence to show how increasing the use of TGR in FLR can increase benefits, locally and globally. We provide practical guidelines for improving integration of TGR in FLR, for consideration by a wide range of stakeholders, in particular: (1) countries and national policymakers; (2) donors and funding bodies; (3) international organizations and regional networks; and (4) restoration practitioners. Finally, we present a list of eight key recommendations to support the delivery of TGR for maximizing restoration outcomes towards reversing biodiversity loss, mitigating and adapting to climate change, and supporting sustainable food systems and improved livelihoods. We hope that this paper will contribute to achieving the United Nations Sustainable Development Goals (SDGs), specifically SDGs 1 (No Poverty), 2 (Zero Hunger), 13 (Climate Action) and 15 (Life on Land) through greater and more effective use of TGR in FLR implementation
    corecore